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ABSTRACT 



1 Introduction 



We consider the problem of learning a certain type 
of lexical semantic knowledge that can be expressed 
as a binary relation between words, such as the 
so-called sub-categorization of verbs (a verb-noun 
relation) and the compound noun phrase relation 
(a noun-noun relation). Specifically, we view this 
problem as an on-line learning problem in the sense 
of Littlestone's learning model [Lit88] in which the 
learner's goal is to minimize the total number of 
prediction mistakes. In the computational learn- 
ing theory literature, Goldman, Rivest and Schapire 
[GRS93] and subsequently Goldman and Warmuth 
[GW93] have considered the on-line learning prob- 
lem for binary relations R : X x Y — > {0, 1} in which 
one of the domain sets X can be partitioned into 
a relatively small number of types, namely clusters 
consisting of behaviorally indistinguishable members 
of X. In this paper, we extend this model and sup- 
pose that both of the sets X , Y can be partitioned 
into a small number of types, and propose a host 
of prediction algorithms which are two-dimensional 
extensions of Goldman and Warmuth's weighted 
majority type algorithm proposed for the original 
model. We apply these algorithms to the learning 
problem for the 'compound noun phrase' relation, in 
which a noun is related to another just in case they 
can form a noun phrase together. Our experimental 
results show that all of our algorithms out-perform 
Goldman and Warmuth's algorithm. We also theo- 
retically analyze the performance of one of our algo- 
rithms, in the form of an upper bound on the worst 
case number of prediction mistakes it makes. 
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A major obstacle that needs to be overcome for the 
realization of a high quality natural language pro- 
cessing system is the problem of ambiguity resolu- 
tion. It is generally acknowledged that some form 
of semantic knowledge is necessary for a successful 
solution to this problem. In particular, the so-called 
sub-categorization of verbs is considered essential, 
which asks which verbs can take which nouns as 
a subject, a direct object, or as any other gram- 
matical role. A related form of knowledge is that 
of which nouns are likely to form compound noun 
phrases with which other nouns. These simple types 
of semantic knowledge can be expressed as a bi- 
nary relation, or more in general an n-ary relation, 
between words. Since inputing such knowledge by 
hand is prohibitively expensive, automatic acquisi- 
tion of such knowledge from large corpus data has 
become a topic of active research in natural language 
processing. (c.f.[PTL92, Per94]) 

In the computational learning theory literature, the 
problem of learning binary relations has been con- 
sidered by Goldman et al [GRS93, GW93], in the 
on-line learning model of Littlestone [Lit88] and var- 
ious extensions thereof. Note that a binary relation 
R between sets X and Y can be thought of as a con- 
cept over the Cartesian product IxY, ora function 
from IxFto {0, 1} defined by R(x,y) = 1 if and 
only if R holds between x £ X and y £ Y . Thus, 
Littlestone's on-line learning model for concepts can 
be directly adopted. Such a function can also be 
thought of as a matrix having value R(x, y) at row x 
and column y. Goldman et al assumed that the rows 
can be partitioned into a relatively small number of 
'types', where any two rows £1,2:2 £ X are said to 
be of the same type if they are behaviorally indis- 



tinguishable, i.e. R(xi,y) = R(x 2 ,y) for all y G Y . 
This is a natural assumption in our current prob- 
lem setting, as indeed similar nouns such as 'man' 
and 'woman' seem to be indistinguishable with re- 
gard, for example, to the subject-verb relation. Un- 
der this assumption, the learning problem can be 
basically identified with the problem of discovering 
the proper clustering of nouns in an on-line fashion. 
Indeed the weighted majority type algorithm pro- 
posed by Goldman and Warmuth for this problem 
fits this intuition. (This is the algorithm 'Learn- 
Relation(O)' in [GW93], but in this paper we refer 
to it as WMPO.) Their algorithm keeps a 'weight' 
w(xi, x 2 ) representing the believed degree of similar- 
ity for any pair £1, £2 £ X, and at each trial predicts 
the label R(x, y) by weighted majority vote among 
all x' G X such that it has already seen the correct 
label R(x',y), each weighted according to w(x,x'). 
The weights are multiplicatively updated each time 
a mistake is made, reflecting whether x' contributed 
positively or negatively to the correct prediction. 

The above algorithm takes advantage of the simi- 
larities that exist within X, but does not make use 
of similarities that may exist within Y. In our cur- 
rent scenario, this may incurr a significant loss. In 
the subject-verb relation, not only the nouns but 
the verbs can also be classified into types. For ex- 
ample, the verbs 'eat' and 'drink' are sufficiently 
similar that they basically allow the same set of 
nouns as their subject. Motivated by this observa- 
tion, in this paper we propose extensions of WMPO, 
called 2- dimensional weighted majority prediction 
algorithms, which take advantage of the similarities 
that exist in both X and Y . 

We propose two basic variants of 2-dimensional 
weighted majority prediction algorithms, WMP1 
and WMP2. Both of these algorithms make use of a 
weight u(xi, X2) for each pair x±, x 2 G X (called the 
'row weights') and a weight v (3/1, 3/2) f° r each pair 
3/1,3/2 G Y (called the 'column weights'). WMP1 
makes the prediction on input (x, y) G X x Y by 
weighted majority vote over all past examples, with 
each pair weighted by the product of the correspond- 
ing row weight and column weight. It can thus make 
a rational prediction on a new pair even if both 

i and j are unseen in the past. The row weights 
are updated trusting the column weights and vice- 
versa. That is, after a prediction mistake occurs 
on each row weight u(i,i') is multiplied by 



the ratio between the sum of all column weights 
v(j,j') for the columns j' such that M(i',j') con- 
tributed to the correct prediction for and the 
sum of v(j,j') for the columns contributing to the 
wrong prediction. The more conservative of our two 
variants, WMP2, makes its predictions by majority 
vote over only the past examples in either the same 
row or in the same column as the current pair to be 
predicted. The weights are updated in a way simi- 
lar to the update rule used in WMPO. We also use 
the following combination of these two algorithms, 
called WMP3. WMP3 predicts using the prediction 
method of WMP1, but updates its weights using the 
more conservative update rule of WMP2. 

We apply all of these algorithms to on-line learn- 
ing of lexical semantic knowledge, in particular to 
the problem of learning the 'compound noun phrase' 
relation, namely the binary relation between nouns 
in which a noun is related to another just in case 
they can together form a compound noun phrase. 
We extracted two-word compound noun phrases 
from a large 'tagged' corpus, and used them as 
training data for learning the relation restricted on 
those nouns that appear sufficiently frequently in 
the corpus. 2 Our experimental results indicate that 
our algorithms outperform WMPO using weights on 
X x X, which we call WMPO(X), and WMPO with 
weights on Y x Y, called WMPO(Y), as well as the 
weighted majority algorithm (exactly in the sense of 
[LW89]) which we call WMP4, using WMPO(X) and 
WMPO(Y) as sub-routines. These results also show 
that based on just 100 to 200 examples (representing 
5 to 10 percent of the entire domain) our algorithms 
achieve about 80 to 85 per cent prediction accuracy 
on an unknown input. 

We also theoretically analyze the performance of one 
of our algorithms. In particular, we give an upper 
bound on the worst-case number of mistakes made 
by WMP2 on any sequence of trials, in Littlestone's 
on-line learning model. The bound we obtain is 

_i_( H (m + n) + (In + km)^2{m + n) log *g±=l), 
where n = |X|,m = \Y\, k is the number of row 
types, and I is the number of column types. We 

1 More precisely, we 'clump' this ratio between two 
constants, such as 0.5 and 2.0. 

2 Note that in any corpus data there are only posi- 
tive examples, whereas the algorithms we propose here 
require the use of both positive and negative examples. 
We describe in Section 4 how we generate both positive 
and negative examples from a given corpus. 



note that this bound looks roughly like the weighted 
average of the bound shown by Goldman and War- 
muth for WMPO(X), km + n^Zm log k, and that for 
WMPO(Y), In + mV3nlogZ, and thus tends to fall 
in between them. 

Finally, we tested all of our learning algorithms 
on randomly generated data for an artificially con- 
structed target relation. The results of this experi- 
mentation confirm the tendency of our earlier exper- 
iment that WMP1, WMP2 and WMP3 outperform 
all of WMPO(X), WMPO(Y), and its weighted ma- 
jority WMP4, apparently contradicting the above 
mentioned theoretical findings. Our interpretation 
of these results is that although in terms of the worst 
case mistake bounds, it is difficult to establish that 
our algorithms outperform the 1-dimensional algo- 
rithms, but in practice they seem to do better. 

2 On-line Learning Model for 
Binary Relations 

As noted in Introduction, a binary relation R be- 
tween sets X and Y is a concept over X xY , or equiv- 
alently a function from X x Y to {0, 1} defined by 
R(x, y) = 1 if and only if R holds between x and y. 
In general, a learning problem can be identified with 
a subclass of the class of all concepts over a given 
domain. In this paper, we consider the subclass of 
all binary relations defined over finite sets X x Y, in 
which both X and Y are classified into a relatively 
small number of 'types.' Formally, we say that a 
binary relation R over X x Y is a (fe, Z)-relation, if 
there are at most k row types and I column types, 
namely R satisfies the following conditions. 

• There exist a partition V = {Pj C X : i = 
1, k} of X such that VPj, i = 1, kVxi, X2 G 
PiVyEY [R{x u y) = R(x 2 ,y)}. 

• There exist a partition Q = {Qj C: j = 1, 1} 
o{Y such that VQj , i = 1, ZVj/i, j/2 G QjVx G 
X [R(x, yi ) = R(x,y 2 )}. 

Next, we describe the on-line learning model for bi- 
nary relations. A learning session in this model con- 
sists of a sequence of trials. At each trial the learner 
is asked to predict the label of a previously unseen 
pair (x,y) G X x Y based on the past examples. 
The learner is then presented with the correct label 
R(x, y) as reinforcement. A learner is therefore a 
function that maps any finite sequence of labeled ex- 
amples and a pair from X x Y, to a prediction value, 



or 1. A learner's performance is measured in terms 
of the total number of prediction mistakes it makes 
in the worst case over all possible instance sequences 
exhausting the entire domain, i.e. 1x7. When the 
total number of mistakes made by a learning algo- 
rithm, when learning a target relation belonging to 
a given class, is always bounded above by a certain 
function, of various parameters quantifying the com- 
plexity of the learning problem, such as \X\, \Y\,k 
and I, then we say that that function is a mistake 
bound for that algorithm and that class. 

3 Two-dimensional Weighted 

Majority Prediction Algorithms 

In this section, we give the details of all variants of 2- 
dimensional WMP algorithms informally described 
in Introduction, as well as the original 1-dimensional 
WMP algorithm of [GW93]. In the algorithm de- 
scriptions to follow, we use the following notation. 
We let R denote the target relation to be learned, 
and R(i,j) its label for (i, j). We let M denote the 
'observation matrix' obtained from the past trials. 
That is, M(i,j) = 1 (or M(i,j) = 0) just in case 
P(i, j) = 1 (or R(i,j) = 0) has been observed in the 
past, and M(i,j) =? indicates that (i, j) has not 
been seen so far. When we write M(i,j) ^ R(i',j'), 
we mean that M(i,j) ^1 and M(i,j) ^ R(i',j'). 
Finally, we use WMPO(X) to denote WMPO us- 
ing weights between pairs of members of X, and 
WMPO(Y) to denote WMPO using weights between 
pairs of members of Y. 

Algorithm WMPO(X) [GW93] 

(1-dimensional weighted majority prediction) 

Initialize all weights w(i, i') to 1 

Do Until No more pairs are left to predict 

Get a new pair (i, j) and predict P(i, j) as follows: 

then predict R(i,j) = 1 
else predict R(i,j) = 
Get the correct label R(i,j) 
If a prediction mistake is made 

then for all i' such that M(i',j) = P(i, j) 
w(i, i') := (2 — 7) • w(i, i') 
and for all i' such that M(i',j) 7^ P(i, j) 
w(i, i') := 7 • w(i, i') 
End Do 

Algorithm WMP1 

(weighted majority over all past examples) 



For all i,j, u(i,i) := u init ;v(j,j) := v init 

Initialize all other weights to 1 

Do Until No more pairs are left to predict 

Get a new pair (i, j) and predict _R(i, j) as follows: 

If Ejif(i',,')=i ,1 (M') • v U,f) 

> Y.M(i',j')=0 U {h i ')- V {hj') 

then predict _R(i, j) = 1 

else predict R(i,j) = 
Get the correct label _R(i, j) 
If a prediction mistake is made 

then for all i' , j' update weights as follows 

u* := max{M; OTO ,min{M tlp , M (-'^')= R M }} 

u(i, i') := w(i, i') • w* 

JW(i' l j') = R(i.j) V ' ' "I "I 

v := max-ju;^ , min-ju^, ^ — . }} 

v{j,j') ■= v{j,j') ■ V* 
For all i, w(i,i) := max{wi n it , • w(i,i)} 
For all j, v(j,j) := max{t) m i tl » BJ) • v(j,j)} 
End Do 

Algorithm WMP2 

(weighted majority over same row and column) 

Initialize all weights to 1 

Do Until No more pairs are left to predict 

Get a new pair (i, j) and predict _R(i, j) as follows 

If Ejif(i',,>i *') + Em(,j>i Ki> i') 

> Ejifp,,>o *') + E#(y')=o Ki.i') 

then predict _R(i, j) = 1 
else predict R(i,j) = 

Get the correct label _R(i, j) 

If a prediction mistake is made 

then for all i' , j' update weights as follows 
If M(i', j) = R(i, j) then u(i, i') := (2 - j)u(i, i') 
else if M(i', j) ^ R(i, j) then i') := 7 • i') 
If M(i, j') = R(i,j) then := (2 -j)v(j,f) 

else if M(i, j') ^ R(i, j) then j') := 7 • j') 

End Do 

Algorithm WMPS 

(mixed strategy between WMP1 and WMP2) 

Initialize all weights to 1 

Do Until No more pairs are left to predict 
Predict with the prediction rule of WMP1 
Update the weights by the update rule of WMP2 

End Do 

Algorithm WMP4 

(weighted majority over WMPO(X) and WMPO(Y)) 

Initialize weights wi and W2 to 1 

Do Until No more pairs are left to predict 



Get a new pair (i, j) and predict _R(i, j) as follows: 

j£ Wl -WMP0(X)+w 2 -WMP0(Y) 1 
w 1 +w 2 2 

then predict R(i,j) = 1 
else predict R(i,j) = 
Get the correct label R(i,j) and update weights 
as follows 

If R(i,j) ^ WMPO(X) 

then wi := (3wi and update weights of 

WMPO(X) according to WMPO 
If R{i,j) ^ WMP0(Y) 

then W2 := /3u>2 and update weights of 

WMPO(Y) according to WMPO 
End Do 

In the above description of WMP1, u up ,ui ow ,v up 
and vi ow are any reals satisfying u up > l,ui ow < 
Ij^tip > 1 an d vi ow < 1, but we set u up = v up = 2 
and u\ ow = v\ ow = | in our experiments. We set 
Ui n i t = Vi n i t = 10 in our experiments. In WMPO 
and WMP2, we set 7 = f° r some P & [°, !), so 
that we have 7/(2 — 7) = (3. In our experiments, 3 we 
used (3 = \. Finally, in WMP4, (3 can be any real 
number in the range (0, 1), but in our experiments 
we set (3 = |. 

4 Experimental Results 

4.1 Learning Lexical Semantic Knowledge 

We performed experiments on the problem of learn- 
ing the 'compound noun phrase' relations. As train- 
ing data, we used two-word compound noun phrases 
extracted from a large tagged corpus. The problem 
here is that although our learning algorithms make 
use of positive and negative examples, only positive 
examples are directly available in any corpus data. 
To solve this problem, we make use of the notion 
of 'association ratio,' which has been proposed and 
used by Church and Hanks [CH89] in the context 
of 'corpus-based' natural language processing. The 
association ratio between x and y quantifies the like- 
lihood of co-occurrence of x and y, and is defined as 
follows. (All logarithms are to the base 2 in this 

3 When we use WMPO or WMP2 to predict a target 
relation which is 'pure' in the sense [GW93] that it is 
exactly a (fc,Z)-binary relation for some small k and I, 
we can let (3 = 0. In practice, however, it is likely that 
the target relation is almost a. (fc,Z)-binary relation with 
a few exceptions. When learning such a relation, set- 
ting (3 = is too risky and it is better to use a more 
conservative setting, such as /3 = j. 
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We wrote P(x), P{y) for the respective occurrence 
probability for x and y, and P(x,y) for the co- 
occurrence probability of x and y. In the actual 
experiments, we used pairs of nouns with associa- 
tion ratio greater than 0.5 as positive examples, and 
those with association ratio less than -4.5 as negative 
examples. 

We now give a detailed description of our exper- 
iments. We extracted approximately 80,000 two- 
word noun phrases from the Penn Tree Bank tagged 
corpus consisting of 120,000 sentences. We then per- 
formed our learning experiments focusing on the 53 
most frequently appearing nouns on the left and the 
40 most frequently appearing nouns on the right. We 
show the entire lists of these nouns in Figures 1 and 
2. We then obtained positive and negative examples 
for these 53 x 40 pairs of nouns listed above from the 
corpus using association ratio, as described earlier in 
this section. There were 512 of these. Figure 3 shows 
several of these examples chosen arbitrarily from the 
512 examples, paired with their association ratios. 

In our experiments, we evaluated various predic- 
tion algorithms by the number of prediction mis- 
takes they make on the training data obtained in 
the manner just described. More specifically, using 
a random number generator, we obtained ten dis- 
tinct random permutations of the 512 training data, 
and we tested and compared the number of predic- 
tion mistakes made by WMP1 through WMP4 as 
well as WMP0. 

The results of this experiment are shown in Figure 4. 
Figure 4(a) shows how the cumulative prediction ac- 
curacy, i.e. the number of mistakes made up to that 
point divided by the number of trials, changes at var- 
ious stages of a learning session, averaged over the 
ten sessions. Figure 4(b), on the other hand, plots 
(the approximation of) the instantaneous prediction 
accuracy achieved at various stages in a learning ses- 
sion, again averaged over the ten sessions. More pre- 
cisely, the value plotted at each trial is the average 
percentage of correct predictions in the last 50 trials 
(leading up to the trial in question). 

Inspecting these experimental results reveals a cer- 
tain definite tendency. That is, with respect to both 
the cumulative prediction accuracy (or equivalently 
the total number of prediction mistakes made), 



and the 'instantaneous' prediction accuracy, all of 
the algorithms we propose outperform WMP0(X), 
WMP0(Y) and their weighted majority. It is worth 
noting that the instantaneous prediction accuracy 
achieved by our algorithms after 100 trials is already 
about 80 per cent and after 200 trials reaches about 
85 per cent, and then levels off. This seems to indi- 
cate that after seeing only 5 to 10 per cent of the en- 
tire domain, they achieve the level of generalization 
that is close to the best possible for this particular 
problem, which we suspect is quite noisy. 

Examining the final settings of the weights, it did not 
appear as if our learning algorithms were discover- 
ing very clear clusters. Moreover, the final weight 
settings of WMP1 and WMP2 were not particu- 
larly correlated, even though their predictive perfor- 
mances were roughly equal. In Figure 5, we exhibit 
the final settings of the column weights in WMP1 
between the noun 'stock' and some of the other col- 
umn nouns, sorted in the decreasing order. Perhaps 
it makes sense that the weight between 'stock' and 
'maker' is set small, for example, but in general it 
is hard to say that a proper clustering has been dis- 
covered. Interestingly, however, its predictive per- 
formance is quite satisfactory. 

We feel that these results are rather encouraging, 
considering (i) that the target relation is most likely 
not a pure (fe, Z)-relation for reasonably small k and 
I, and (ii) that among the nouns that were used in 
this experiment, there are not so many 'related' ones, 
since we chose the 40 (or 53) most frequently occur- 
ring nouns in a given corpus. 

4.2 Simulation Experiments with 
Artificially Generated Data 

We performed controlled experiments in which we 
tested all of our algorithms on artificially generated 
data. We used as the target relation a 'pure rela- 
tion' defined over a domain of a comparable size to 
our earlier experiment (40 x 50), having 4 row types 
and 5 column types. In other words, the parameter 
setting we chose are n = 40, m = 50, k = 4, and 
I = 5. Each row and column type was equally sized 
(at 10). We tested our algorithms, plus WMP0(X), 
WMP0(Y), and WMP4 on ten randomly generated 
complete trial sequences, namely sequences of length 
40 x 50. As before, Figure 6(a) shows the cumula- 
tive prediction accuracy (at various stages of a learn- 
ing session) averaged over ten learning sessions, and 
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Figure 1: Nouns on the left hand side 
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Figure 2: Nouns on the right hand side 



Figure 6(b) plots the average approximate instanta- 
neous prediction accuracy, calculated using 50 most 
recent trials at each trial. 

These results seem to indicate that, at least for pure 
relations with reasonable number of types, all our al- 
gorithms, WMP1, WMP2 and WMP3, outperform 
WMPO(X), WMPO(Y) and their weighted majority, 
confirming the tendency observed in our earlier ex- 
periment on lexical semantic knowledge acquisition. 
Moreover, the learning curves obtained for the sim- 
ulation experiments are quite close to those for the 
earlier experiment. 

Our algorithms achieve about 93 per cent cumulative 
prediction accuracy at the end of a learning session. 
This means that roughly 2000 x 0.07 = 140 mistakes 
were made in total. How does this compare with 
theoretical bounds on the number of mistakes for 
these algorithms ? In a companion paper [NA95], it 
is shown that a worst case number of mistakes for 
any algorithm learning a (fe,Z)-binary relation is at 
least kl + (n — k) log k + (m — I) log I. Plugging in 
the values n = 40, m = 50, k = 4, and I = 5, we 
obtain 216.1. So our algorithms seem to perform in 
practice even better than the theoretically best pos- 
sible worst case behavior by any algorithm. In the 
next section, we show for WMP2 the mistake bound 



£i (kl{m + n) + (In + km) ^2(m + n) log *g±2l) 

which upon substitution of the concrete values be- 
comes 907.76. The bounds due to Goldman and 
Warmuth [GW93] on WMP0(X) and WMP0(Y), 
km + n-^/3ralog k and In + ra-^/3niog I, come out to 
be 892.8 and 1034.6, respectively. Although these 
bounds seem to be all gross over-estimates of the 
number of mistakes in the typical situation we have 
here, 4 the tendency is clear. The bound for WMP2 
is worse than the better of the bounds for WMP0(X) 
and WMP0(Y). In our experiments, this is not the 
case and our 2-dimensional extensions out-perform 
both WMP0(X) and WMP0(Y). Our feeling is that 
this does not necessarily mean that our mistake 
bound can be improved drastically. Rather, these 
findings seem to cry for the need of theoretical anal- 
ysis of typical behavior of these algorithms, perhaps 
in some form of average case analysis. 

5 Theoretical Performance Analysis 

In this section, we prove the following mistake bound 
for WMP2. As we noted in Introduction, our upper 
bound looks roughly like the weighted average of the 
bounds of [GW93] for WMP0(X) and WMP0(Y), 

4 It should be noted that these bounds become much 
more sensible for larger values of n, m, k, I. 



market manager -4.769645 
production growth 1.928957 
capital gain 5.032259 
service price -4.651155 
mortgage payment 2.601217 
auto maker 3.871484 
future company -4.999019 
insurance president -5.614235 



ad industry 1.838923 
future contract 4.337400 
government security 1.997249 
food issue 1.028839 
industry market -5.716859 
law price -4.998001 
program increase 0.976705 
product group 0.926659 



equity issue 1.446904 
vice concern -4.715181 
company official 4.345481 
sale growth 5.309577 
equity group 0.860507 
service firm 0.562419 
revenue bond 3.955944 
exchange president -5.099226 



Figure 3: Part of the training data 




Figure 4: (a) Average cumulative prediction accuracy and (b) Average instantaneous prediction accuracy 



and thus tends to be in between the two bounds. We 
add that we have not been able to prove a rigorous 
mistake bound for WMP1. We expect that in fact 
no non-trivial worst case mistake bound for WMP1 
exists. 

Theorem 5.1 When learning a (k,l)-binary rela- 
tion, Algorithm WMP2 makes at most 



± ^( TO+ n) + (^ TO )^2( TO+ n)log^^lj 

mistakes in the worst case, provided k,l > 2. 

(Proof) We need the following definitions and nota- 
tion. Let n p denote the number of rows of type p and 
let m q denote the number of columns of type q. Let 
yUp denote the number of mistakes made in row type 
p, and let n° denote the number of mistakes made 
in column type q. We then let ji denote the total 



number of mistakes, i.e., fj, = Y^=i^l = Y^ q =i^ C q - 
We write £ r v for the set of all edges between two 
rows of type p £ {1, k}, and £ c q for the set of all 
edges between two columns of type q £ {1, I}. We 
write e^ ii2 for the edge between row ii and row i 2 , 
and e C A „• the edge between column 71 and column 
j2- Extending the notion of 'force' used in the proof 
of Theorem 4 in [GW93], for each prediction mis- 
take made, say in predicting we define the row 
force of the mistake to be the number of rows i' of 
the same type as i for which R(i' , j) was known at 
the time of the prediction. Let F£ denote the sum 
of the row forces of all mistakes made in row type 
p. We define the column force of a mistake analo- 
gously, and let F q denote the sum of column forces 
of mistakes made in column type q. 

The theorem is proved using the following two lem- 
mas. 



stock-index 2.0 stock-executive 1.0 

stock-sale 0.55 stock-system 0.52 

stock-operation 0.5 stock-business 0.5 

stock-line 0.25 stock-maker 0.22 



stock-share 1.0 
stock-contract 0.5 
stock-bond 0.5 



stock-security 0.74 
stock-industry 0.5 
stock-value 0.5 



stock-trader 0.59 
stock-analyst 0.5 
stock-gain 0.5 



Figure 5: WMPl's weights between 'stock' and other column nouns 




Figure 6: (a) Average cumulative prediction accuracy and (b) Average instantaneous prediction accuracy 

n(n — l) + m(m — 1) 



Lemma 5.1 For each 1 < p < k,l < q < I, 

n(n — 1) + m(m — 1) 



K<(\ £r P \ 



2(\S1 



(Proof) The following two inequalities can be shown 
in a similar manner to the proofs of Lemma 1 and 
Lemma 2 in [GW93]: 



<2(|^| + |^|)log. m;] + m 

Thus the following inequality follows. 

(li r p -m) + (ii c q -n) 



logw(e), 

E, „ n(n — 1) + mSm — 1) 
< ~ '—^ '-■ 

The lemma now follows easily from these two in- 
equalities and Jensen's inequality. □ 
The following analogues for inequality (4) in [GW93] 
for the row and column forces can be readily shown. 

Lemma 5.2 For each 1 < p < k,l < q < I, both of 
the following hold. 



/ , w, ,n , n(n — l) + m(m— 1) 

- y v ai P i-ri q \) 8 2(|e;| + |f 9 c |) 

By summing the above over k,l we get 

k I 

(k + l)fj, < kl(m+n) + y^y~] 

p=l 5=1 

, . . i i i ,. , n(n — l) + m(m— 1) 

j2 (n , + n)(|q| + | £ ;|)io 8 ) , 



Since f(x) = Ky^og J is concave for constant c > 0, 
the following can be shown to hold, where we let 



r\2 



p " 2m 



•p ' 2 ' 9 

5.1 and Le 

K - ™) 2 - n) 



n 

* + 2 



Now from Lemma 5.1 and Lemma 5.2 we obtain 

\2 / ,,c „\2 



(k + l)/j, < kl(rn + n) + \Jm + n ■ kl 
a 



kl 



log 



n(n — 1) + m(m — 1) 



kl(rn + n) + V TO + ' 



ay log 



k 2 P(n(n-l) + m(m-l)) 



< kl(m + n) + \Jm + ■ 



,, k 2 P(n + m) 2 
a \l l °8 ~2 



Since a = E p =i E 9 =i V n p( n P ~ *) + m q{ m q ~ !) 
< Ej=iEUK + m «) = In + km and /(z) = 
Ky^og ^ is monotonically increasing for x in the 
range -% > e, for fc, Z > 2 we have 



(k + l)/j, < kl(m + n) 



(In + km)\ 2(m + n) log 



feZ(n + m) 
Zn + fern 



The theorem follows immediately from this. 



□ 



6 Concluding Remarks 

We have presented 2-dimensional extensions of the 
weighted majority prediction algorithm of [GW93] 
for binary relations, and applied them to the prob- 
lem of learning the 'compound noun phrase' relation. 
A common approach to this problem in natural lan- 
guage processing makes use of some a priori knowl- 
edge about the noun clusters, usually in the form of 
a thesaurus, (c.f. [Res92].) Our algorithms make no 
use of such knowledge. Another common approach 
is the statistical clustering approach (c.f. [PTL92]), 
which views the clustering problem as the maximum 
likelihood estimation of a word co-occurrence dis- 
tribution. Such an approach is based on a sound 
theory of statistics, but is often computational in- 
tractable as the clustering problem is NP-complete 
even in the 1-dimensional case. Our formulation of 
this problem as an on-line learning problem of de- 
terministic binary relations gives rise to algorithms 
that are especially simple and efficient. Our algo- 
rithms seem to somehow bypass having to explicitly 
solve the clustering problem, and yet achieve reason- 
ably high predictive performance. Note also that our 
upper bound on the worst case number of mistakes 
made by WMP2 relies on no probabilistic assump- 
tion on the input data. In the future, we would like 
to apply our algorithms on other related problems, 
such as that of learning verb sub-categorization re- 
lations. 
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